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Abstract —Generalized Linear Models (GLMs), where a ran¬ 
dom vector X is observed through a noisy, possibly nonlinear, 
function of a linear transform z = Ax, arise in a range of 
applications in nonlinear filtering and regression. Approximate 
Message Passing (AMP) methods, based on loopy belief prop¬ 
agation, are a promising class of approaches for approximate 
inference in these models. AMP methods are computationally 
simple, general, and admit precise analyses with testable condi¬ 
tions for optimality for large i.i.d. transforms A. However, the 
algorithms can diverge for general A. This paper presents a 
convergent approach to the generalized AMP (GAMP) algorithm 
based on direct minimization of a large-system limit approx¬ 
imation of the Bethe Free Energy (LSL-BFE). The proposed 
method uses a double-loop procedure, where the outer loop 
successively linearizes the LSL-BFE and the inner loop minimizes 
the linearized LSL-BFE using the Alternating Direction Method 
of Multipliers (ADMM). The proposed method, called ADMM- 
GAMP, is similar in structure to the original GAMP method, but 
with an additional least-squares minimization. It is shown that for 
strictly convex, smooth penalties, ADMM-GAMP is guaranteed 
to converge to a local minimum of the LSL-BFE, thus providing 
a convergent alternative to GAMP that is stable under arbitrary 
transforms. Simulations are also presented that demonstrate the 
robustness of the method for non-convex penalties as well. 

Index Terms —Belief propagation, ADMM, variational opti¬ 
mization, message passing, generalized linear models. 

I. Introduction 

Consider the problem of estimating a random vector x S K" 
from observations y G M"* as shown in Fig. The un¬ 
known vector is assumed to have a prior density of the form 
p(x) = and the observations y G K™ are described 

by a likelihood function of the form p(y|x) = for 

some known transform A G In statistics, this model 

S. Rangan (email: srangan@poly.edu) is with the Department of Electrical 
and Computer Engineering, Polytechnic Institute of New York University, 
Brooklyn, NY. His work was supported by the National Science Foundation 
under Grant No. 1116589. 

A. K. Fletcher (email: akfletcher@ucla.edu) is with the Departments of 
Statistics, Mathematics, and Electrical Engineering, University of California, 
Los Angeles. 

P. Schniter (email: schniter@ece.osu.edu) is with the Department of Elec¬ 
trical and Computer Engineering, The Ohio State University. His work was 
supported in part by the National Science Foundation Grants CCF-1218754, 
CCF-1018368, and CCF-1527162. 

U. S. Kamilov (email: ulugbek.kamilov@epfl.ch) is with the Biomedical 
Imaging Group, Ecole polytechnique federale de Lausanne (EPFL), CH-1015 
Lausanne VD, Switzerland. His work was supported by the European Re¬ 
search Council under the European Union’s Seventh Framework Programme 
(FP7/2007-2013)/ERC Grant Agreement 267439. 

This work was presented in part at the 2015 IEEE Symposium on Infor¬ 
mation Theory. 


is a special case of a generalized linear model (GLM) |jTj, 
Q and arises in a range of applications including statistical 
regression, hltering, inverse problems, and nonlinear forms of 
compressed sensing. The posterior density of x given y in the 
GLM model is given by 

Fx|y(x|y) = ^^exp[-/^(x) - /^(Ax,y)], (1) 

where E(y) is a normalization constant. In the sequel, we will 
often omit the dependence on y and simply write 

Px|y(x|y) = ^ exp [-f:r(x) - f^(Ax)] , (2) 

SO that the dependence on y in the function /z( ) and the 

normalization constant Z is implicit. In this work, we consider 
the inference problem of estimating the posterior marginal 
distributions, Pxj\y{xj\y). From these posterior marginals, one 
can compute the posterior means and variances 

Xj=E{xj\y), (3a) 

=vaT{xj\y). (3b) 

We study this inference problem in the case where the func¬ 
tions /a; and fz are separable, in that they are of the form 

n 

fx{^) ='^fx,{xj), (4a) 

m 

fz{z) ='^fzM), (4b) 

i=l 

for some scalar functions fx and /^.. The separability as¬ 
sumption ( |4a| l corresponds to the components in x being 
a priori independent. Recalling the implicit dependence of 
fz on y, the separability assumption (14§ corresponds to 
the observations y being conditionally independent given the 
transform outputs z = Ax. 

For posterior densities of the form Q, there are several 
computationally efficient methods to hnd the maximum a 
posteriori (MAP) estimate, which is given by 

X = argmaxpx|y(x|y) = argmin [^(x) -f /z(Ax)]. (5) 

X X 

Under the separability assumptions Q, the MAP minimization 
0 admits a factorizable dual decomposition that can be 
exploited by a variety of approaches, including variants of the 
iterative shrinkage and thresholding algorithm (ISTA) Q-Q 
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Fig. 1. Generalized Lineai' Model (GLM) where an unknown random vector 
X is observed via a linear transform followed by componentwise likelihood 
to yield a measurement vector y. 


and the alternating direction method of multipliers (ADMM) 

©-GD- 

In contrast, the inference problem of estimating the posterior 
marginals p{xj\y) and the corresponding minimum mean 
squared error (MMSE) estimates ( [3al l is often more difficult— 
even in the case when and are convex. As a simple 
example, consider the case where /x(x) = 0 and each fzi{zi) 
constrains Zi to belong to some interval, so that /^(Ax) 
constrains x to belong to some polytope. The MAP estimate 
0 is then given by any point in the polytope. Such a point can 
can be computed via a linear program. However, the MMSE 
estimate ( |^ is the centroid of the polytope which is, in 
general, #P-hard to compute GD- 

GLM inference methods often use a penalized quasi¬ 
likelihood method or some form of Gibbs sampling GD, 
G§. In recent years, Bayesian forms of approximate message 
passing (AMP) have been considered as a potential alternate 
class of methods for approximate inference in GLMs GD- 
p^ . AMP methods are based on Gaussian and quadratic 
approximations to loopy belief propagation (loopy BP) in 
graphical models and are both computationally simple and 
applicable to arbitrary separable penalty functions and /^. 
In addition, for certain large i.i.d. transforms A, they have 
the beneht that the behavior of the algorithm can be exactly 
predicted by a state evolution analysis, which then provides 
testable conditions for Bayes optimality |[22)-| 241. 

Unfortunately, for general A, AMP methods may diverge 
1^, p6) —a situation that is not surprising given that AMP 
is based on loopy BP, which also may diverge. Several recent 
modihcations have been proposed to improve the stability of 
AMP, including damping p5) , sequential updating | |27) , and 
adaptive damping | |28) . However, while these methods appear 
to perform well empirically, little has been proven rigorously 
about their convergence. 

The main goal of this paper is to provide a provably con¬ 
vergent approach to AMP. We focus on the generalized AMP 
(GAMP) method of p^ , which allows arbitrary separable 
functions for both and /^. Our approach to stabilizing 
GAMP is based on reconsidering the inference problem as a 
type of free-energy minimization. Specihcally, it is known that 
GAMP can be considered as an iterative procedure for min¬ 
imizing a large-system-limit approximation of the so-called 
Bethe Tree Energy (BEE) p9) , pO) , which we abbreviate as 
“LSL-BEE” in the sequel. The BEE also plays a central role 
in loopy B P |[3T] , and we review both the BEE and LSL-BEE 
in Section Hill 

In contrast to GAMP, which implicitly minimizes the LSL- 
BEE through an approximation of the sum-product algorithm. 


our proposed approach explicitly minimizes the LSL-BEE. We 
propose a double-loop algorithm, similar to the well-known 
Convex Concave Procedure (CCCP) p^ . Specihcally, the 
outer loop of our method successively approximates the LSL- 
BEE by partially linearizing the LSL-BEE around the current 
belief estimate, while the inner loop minimizes the linearized 
LSL-BEE using ADMM Q. Similar applications of ADMM 
have also been proposed for related free-energy minimizations 
0, pg. Interestingly, our proposed double-loop algorithm, 
which we dub ADMM-GAMP, is similar in structure to the 
original GAMP method of p2) , but with an additional least 
squares optimization. We discuss these differences in detail in 
Section |Vm] 

Our main theoretical result shows that, for strictly convex 
penalties, the proposed ADMM-GAMP algorithm is guaran¬ 
teed to converge to at least a local minimum of the LSL-BEE. 
In this way, we obtain a variant of the GAMP method with 
a provable convergence guarantee for arbitrary transforms A. 
In addition, using hardening arguments similar to p5) , p^ , 
we show that our ADMM-GAMP can also be applied to the 
MAP estimation problem, in which case we can obtain global 
convergence for strictly convex, smooth penalties. Also, while 
our theory requires convex penalties, we present simulations 
that show robust behavior even in non-convex cases. 


H. The GLM and Examples 

Before describing our optimization approach, it is useful to 
briefly provide some examples of the model © to illustrate 
the generality of the framework. As a first simple example, 
consider a simple linear model 

y = Ax -f e,, (6) 

where A is a known matrix, x is an unknown vector and e is 
a noise vector. In statistics, A would be the data matrix with 
predictors, x would be the vector of regression coefficients, y 
the vector of target or response variables and e would represent 
the model errors. To place this model in the framework of this 
paper, we must impose a prior p(x) on x and model the noise 
e as a random vector independent of A and x. Under these 
assumptions, the posterior density of x given y will be of the 
form G) if we define 

^(x) := -lnp(x), /^(z) := -lnp(y|z) = -lnp,(y-z). 

(7) 

The separability assumption 0 will be valid if the components 
of Xj are Wi are independent so the prior and noise density 
factorizes as 


j=i i=i 

If the output noise e is Gaussian with independent components 
a ^ A/'(0,(Tj), the output factor fz{z) in 0 has a quadratic 
cost, 

/^(z) = 
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Similarly, if x has a Gaussian prior with Af{0, cr^I), the input 
factor /a;(x) will be given by 

/x(x) = 

Note that the estimation in this quadratic case would be given 
by standard least squares estimation. 

However, much more general models are possible. For 
example, for Bayesian forms of compressed sensing prob¬ 
lems pT) , one can impose a sparse prior p(x) such as a 
Bemoulli-Gaussian or a heavy-tailed density. 

Also, for the output, any likelihood p(y|z) that factorizes 
as Y\iP{yi\zi) can be incorporated. This model would occur, 
for example, under any output nonlinearities as considered in 



Vz = Aizi) + G, 

where (j)i{zi) is a known, nonlinear function and is noise. 
The model can also include logistic regression where G 
{0,1} is a binary class variable and 

P(y^ = l\zi) = 1 - P{yi = 0\zi) = cr(zi), 

for some sigmoidal function cr(z). One-bit and quantized 
compressed sensing 0 as well as Poisson output models 
ED can also be easily modeled. 

III. Bethe Free Energy Minimization 
We next provide a brief review of the Bethe Free Energy 
(BEE) minimization approach to estimation of marginal densi¬ 
ties in GLMs. A more complete treatment of this topic, along 
with related ideas in variational inference, can be found in 

mj, g2). 

Eor a generic density p(x|y), exact computation of the 
marginal densities p{xj\y) is difficult, because it involves a 
potentially high-dimensional integration. BEE minimization 
provides an approximate approach to marginal density compu¬ 
tation for the case when the joint density admits a factorizable 
structure of the form 

L 

p(x|y) cx |y), (8) 

i=i 

where, for each £, is a sub-vector of x created from 

indices in the subset a{£) and tpi is a potential function on that 
sub-vector. In this case, BEE minimization aims to compute 
the vectors of densities 

b= [6i,...,6n]^ and q = [gi, • • ■, 

where hj{xj) represents an estimate of the marginal density 
p{xj\y) and where g^(xQ,(^)) represents an estimate of the 
joint density p(xQ,(^)|y) on the sub-vector x^(^iy These den¬ 
sity estimates, often called “beliefs,” are computed using an 
optimization of the form 

(b, q) = arg min J(b, q), (9) 

(b,q)G£; 

where J(b, q) is the BEE given by 

L n 

•/(b,q) = '^D{qe\\i;e) + '^{nj - (10) 

e=i j=i 


where is the KL divergence. 


D{a\\b) 4 


a(x )In 


a{x) 

b{x)_ 


dx; 


( 11 ) 


where H{bj) is the entropy or differential entropy; and where 
(for each j) nj is the number of factors £ such that j G a{£). 
The BEE minimization is performed over the set E of 
all (b, q) whose components satisfy a particular “matching” 
condition: for each j G a{£), the marginal density of Xj within 
the belief qt{x^(^i^) must agree with the belief bj{xj). That is, 
the set E contains all (b, q) such that 


J qd^a{e)) ^^a{e)\j = bj(xj), for all £,j, ( 12 ) 


where the integration is over the components in the sub-vector 
Xq,(^) holding Xj constant. Note that E imposes a set of linear 
constraints on the belief vectors b and q. 

The BEE minimization exactly recovers the true marginals 
in certain cases (e.g., when the factor graph has no cycles) and 
provides good estimates in many other scenarios as well; see 
| |42| for a complete discussion. In addition, due to its separable 
structure, the BEE can be typically minimized “locally,” by 
solving a set of minimizations over the densities b and q. 
When the cardinalities of the subsets a{£) are small, these 
local minimizations may involve much less computation than 
directly calculating the marginals of the full joint density 
p(x|y). In fact, the classic result of |31| is that loopy belief 
propagation can be interpreted precisely as one type of iterative 
local minimization of the BEE. 

Eor the GEM in Section]^ the separability assumption (41 
allows us to write the density (|^ in the factorized form (81 
using the L = n + m potentials 


-ipjixj) = exp{-f^.{xj)), j = l,...,n, (13a) 

V'n+i(x) = exp(-/^.(ajx)), i = (13b) 

where aj is the i-th row of A. Note that, if A is a non- 
sparse matrix, then /^.{ajx) depends on all components in 
the vector x. In this case, the application of traditional loopy 
BP—as described for example in | [43) —does not generally 
yield a significant computational improvement. 

The GAMP algorithm from p2) can be seen as an ap¬ 
proximate BEE minimization method for GLMs with possibly 


dense transforms A. Specifically, it was shown in |29| that the 


stationary points of GAMP coincide with the local minima of 
the constrained optimization 


{bx,bz) = argminJ( 62 ^, 62 ) such that (14a) 
E(z|5^) = AE(x| 6 a;) (14b) 

where b^ and b^ are product densities, i.e.. 


bxix) = Ylbx^(xj), b;,{z) =Y[b^,{zi), (15) 

i=i 


2=1 
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and the objective function J{bx^hz) is given by 

+ H{ystr{x.\b^),ysr{z\b:,)), (16) 

Tzj 

'rx = T^J ^var{xj\b^-), (18) 

■Tz = = var(z,| 6 ^,), (19) 

S., = [S],^. 4 y^,J. (20) 

Above, and in the sequel, we use E(x|&a;) G M” to denote the 
expectation of x under x ^ bx, and we use var(x|(i 2 ;) G M" to 
denote the vector whose jth entry is the variance of Xj under 
X ~ bx- Note that var(x|& 2 ;) is not a full covariance matrix. 
Also, is the scale factor that makes 

a valid density over z G K™. Although it is not 
essential for this paper, we note that i/(var(x|&a;), var(z|& 2 )) 
is an upper bound on the differential entropy of b^ that is 
tight when bz has independent Gaussian entries with variances 
Tz = Sto;. It was then shown in pO) that the objective function 
in ( [Tbl l can be interpreted as an approximation of the BFE 
for the GLM from Section |I] in a certain large-system limit, 
where m,n —>■ oo and A has i.i.d. entries. We thus call the 
approximate BFE in ( [TSl l the large-system limit Bethe Free 
Energy or LSL-BFE. 

Similar to the case of loopy BP, it has been shown in | [29) , 
ID that the stationary points of ( |T4| ) are precisely the fixed 
points of sum-product GAMP. Thus, GAMP can be interpreted 
as an iterative procedure to find local minima of the LSL-BFE, 
much in the same way that loopy BP can be interpreted as 
an iterative way to find local minima of the BFE. The trouble 
with GAMP, however, is that it does not always converge (see, 
e.g., the negative results in | |25] , | [2^ , p 8 |). The situation is 
similar to the case of loopy BP. Although several modifications 
of GAMP have been proposed with the goal of improving 
convergence, such as damping p5) , sequential updating HZ)’ 
and adaptive damping p 8 ) , a globally convergent GAMP 
modification remains elusive. 

IV. Minimization via Iterative Linearization 

Our approach to finding a convergent algorithm for mini¬ 
mizing the constrained LSL-BFE employs a generalization of 
the convex-concave procedure (CCCP) of that we will 
refer to as Minimization via Iterative Linearization. 


In 


27r ^ ^ SijTxj 


(17) 



A. The Convex-Concave Procedure 

We first briefly review the CCCP. Observe that, in the BFE 
the D{qi\\il}i) terms are convex in qi and the H{bj) terms 
are concave in bj. Thus, the BFE ( [T0| ) can be written as a sum 
of terms 

>/(b,q) = /(q) -b h{h), 

where / is convex and h is concave. The CCCP finds a 
sequence of estimates of a BFE minimizer (b, q) by iteratively 


linearizing the concave part of this function, i.e.. 


(b'",q'=) = argmin/(b) -b ( 7 '=)'^q, 
(b.q)ei5 


7 


A;+l _ 


dh{q'^) 

dq 


( 21 a) 

( 21 b) 


where dh{q^)/dq denotes the gradient of h at q^. The 
resulting procedure is often called a “double-loop” algorithm, 
since each iteration involves a minimization ( | 21 a| ) that is itself 
usually performed by an iterative procedure. Because / is 
convex and the constraint (b, q) S i? is linear, the mini¬ 
mization problem ( |21a| ) is convex. Thus, the CCCP converts 
the non-convex BFE minimization to a sequence of convex 
minimizations. In fact, it can be shown that the CCCP will 
monotonically decrease the BFE for arbitrary convex / and 
concave h [3^. 


B. Minimization via Iterative Linearization 

For the LSL-BFE, it is not convenient to decompose the 
objective function into a convex term plus a concave term. 
To handle problems like LSL-BFE minimization, we consider 
optimization problems of the form 

minJ(b), J(b) =/(b)-b/i(g(b)), (22) 

h^B 

where now b is a vector in a Hilbert space Hb, B is a closed 
affine subspace of Hb, f ■ Hb —>■ K is a convex functional, 
g : Tffe —>■ RP is a mapping from Hb to for some p G N, 
and : RP —)■ R is an arbitrary functional. Below, we use 
X € R^ to denote the input to h. Note that the functionals h 
and /i(g(-)) may be neither concave nor convex. 

To solve ([22|, we propose the iterative procedure shown 
in Algorithm [T] which is reminiscent of the CCCP. At each 
iteration k, an estimate b^ of argminj^g^ JO^) is computed 
by minimizing the functional 

J(b,7'=)4/(b) + (7'=)Tg(b), (23) 

where 7 ^ G R^ is a “damped” version of the gradient 
dh(g(b))/db. In particular, when the damping parameter 9^ 
is set to unity, the linearization vector is exactly equal to the 
gradient at b^, i.e., 7 ^ = d[h{g{b’‘))]/db, similar to CCCP. 
However, in AIgorithm[^ we have the option of setting 6^ < 1, 
which has the effect of slowing the update on 7 ^. We will see 
that, by setting < 1 , we can guarantee convergence when 
h and/or /i(g(-)) is non-concave. 


C. Convergence of Minimization via Iterative Linearization 

Observe that when / is convex, h is concave, Hb = R^ 
(as when xj are discrete variables), g is the identity map 
(i.e., g(b) = b), and there is no damping (i.e., 9^ = 1 Vfc), 
Algorithm [T]reduces to the CCCP. However, we are interested 
in possibly non-concave /i(g(-)), in which case we cannot 
directly apply the results of p2) . We instead consider the 
following alternate conditions. 

Assumption 1: Consider the optimization problem ( |2^ , and 
suppose that the functions /, g, and h have components 
that are twice differentiable with uniformly bounded second 
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Algorithm 1 Minimization via Iterative Linearization 
Require: Optimization problem ( |2^ . 

1: fc ^ 0 

2 : Select initial linearization 7 ° 

3: repeat 

4: {Minimize the linearized function} 

5: ^ argmin^gB J(b, 7 '=) 

6 : 

7: {Update the linearization} 

8 : Select a damping parameter 6^ G (0,1] 

9: r'" ^ g(b'") 

10 : >yfc+l ^ (1 - 61 ^) 7 *^ + ^ 

OT 

11 : until Terminated 


derivatives. Also, assume that there exists a convex set T such 
that, for all 7 e T: 

(a) The minimization of the linearized function, 

b( 7 ) = argmin J(b, 7 ) (24) 

heB 


exists and is unique. 

(b) At each minimum, the linearized objective is uniformly 
strictly convex in the linear space B in that there exists 
constants ci, C 2 with C 2 > Ci > 0 such that 


Ci||u|p < u'''H( 7 )u < C 2 |lu|p, Vu : 6 ( 7 ) + uG B, 

J25) 

where H( 7 ) is the Hessian of J with respect to b at b( 7 ), 
i.e., 


H(7) = 


9V(b,7) 

Ob i9b^ 


b=b(-y) 


(26) 


and where the constants ci and C 2 do not depend on 7 . 
(c) The gradient obeys dh{^{h))/dT G T when b = b( 7 ). 


Theorem 1: Consider Algorithm under Assumption 
There exists a 0 G (0,1) such that if the damping parameters 
are selected with {) < 9^ < 0 for all fc, and if the initialization 
obeys 7 ° G T, then 7 ^ G T for all k and the objective 
monotonically decreases, i.e., 

J(b'=+i) < J(b'=) Vfc. (27) 

Proof: See Appendix [A| ■ 

The most simple case where Assumption holds is the 
setting where /(b) is strictly convex and smooth, g{h) is 
linear and /i(t) is smooth (but neither necessarily convex 
nor concave). Under these assumptions, J(b, 7 ) would be 
strictly convex for all 7 , thereby satisfying Assumptions (a) 
and (b). The assumption would also be valid for strictly convex 
/(b) and convex p(b), provided we restrict to positive 7 . In 
this case, to satisfy assumption (c), we would require that 
dh{g{h))/dT > 0, i.e. h{T) is increasing in each of its 
component. Interestingly, in the setting we will use below, 
/(b) will be convex, but g{h) will be concave. Nevertheless, 
we will show that the assumption will be satisfied. 


D. Application to LSL-BFE Minimization 

To apply Algorithm [T] to the LSL-BFE minimization ( [T4l l, 
we first take B to be the vector of separable density pairs 
b = {hx',bz) satisfying the moment matching constraint 

B = {{bx;b,) \E{z\b,) = AE{^\bx)} . (28) 


Then, if we define the functions 

/(b) 4 flbxX) = D{bx\\e-f-) + D{b,\\Zf^e-f^) (29a) 

g(b)''' = [var(x| 6 ^);var(z| 6 ^)] = (29b) 

fi(lTx;T;,J) = H(tx,t^), ( 29 c ) 

we see that J{bx,bz) from (HI can be cast into the form in 
( |22l ). Observe that, while / is convex, the function /i(g( )) is, 
in general, neither convex nor concave. Thus, while the CCCP 
does not apply, we can apply the iterative linearization method 
from Algorithmic 

We will partition the linearization vector 7 conformally with 
function g in ( |29b| i as 

7 = [ 1 ./( 2 t ,); 1 ./( 2 tp )], ( 30 ) 


where we use “./” to denote componentwise division of two 
vectors and to denote vertical concatenation. The notation 
in ( |30| ) will help to clarify the connections to the original 
GAMP algorithm. Using the above notation, the linearized 
objective ( | 2 ^ can be written as 

J{bx,b^,Tr,Tp) = D{bx\\e~^’^) + D{bz\\Zf'^e~^‘) 

+ {l./{2Tr)f V2Z[{-x\bx) 

+ {l./{2Tp)f \dii{z\bf). (31) 

Finally, we compute the gradient h' = ^ of the function 
h from ( |29c| i. Similar to 7 , we will partition the gradient into 
two terms. 


_ , A dH{Tx,T^ 


l./( 2 x,) ^ 


^ , A dH{Tx,T^) 


l.l{2Tp) ^ 


OTx ’ ■' ' P Qt^ 

From ( [T7 ] i, the derivative of H with respect to is 
1 dH{Tx,T^) 1 


2 r„ 


dr. 


2 Sj = l SijTx- 


(32) 


(33) 


Similarly, using the chain rule and ( [3^ , we find 

1 _dH{Tx,T^)_^ dH{Tx,T^) d{J2kSikTxk) 


2 t. 


QTx 

1 


E ( 




^{'LkS^kTx^) 

L., 


dTx 


2'KTr, 


1 


^ ^ • Pi 

We can then write < [33| ) and < (34| ) in vector form as 

Tp = STx, l./Tr = s''' [(1 - T^.jTp) ./Xp] . 


(34) 


(35) 


Substituting the above computations into the iterative lin¬ 
earization algorithm. Algorithm [C we obtain Algorithmic We 
refer to this as the outer loop, since each iteration involves a 
minimization of the linearized LSL-BFE in line |5] We discuss 
this latter minimization next and show that it can itself be 
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performed iteratively using a set of iterations that we will refer 
to as the inner loop. 

We will also show shortly that, under certain convexity 
conditions, the conditions of Assumption [T] are satished, so 
that Algorithm will converge to a local minimum of the 
LSL-BFE. 


Algorithm 2 Minimizing LSL-BFE via Iterative Linearization 


Require: LSL-BFE objective function (16 1 with a matrix A. 
1 : 


Select initial linearization r®, t°. 

repeat 

{Minimize the linearized LSL-BFE} 
(btbz) ^ argmim,, ft 


\b^,b,)eB ■ 

{Compute the gradient terms} 
r* ^ var(x| 6 ^), ^ var(z|&^) 


r 1 p J 


rf ^l./(S'r{=) 


P 


{Update the linearization} 

Select a damping parameter 9^ G (0,1] 
1./t/=+i ^ 9^1./t^ -f (1 - 6l'=)l./r/= 
^ 9^1./t^ + (1 - 9’=)1./t^ 


until Terminated 


E. Alternative Methods 

While the method proposed in this paper is based on CCCP 
of l l^ , there are other methods for direct minimization of the 
BFE that may apply to the LSL-BFE as well. For example, for 
problems with binary variables and pairwise penalty functions, 
| [45) propose a clever re-parametrization to convert the 
constrained BFE minimization to an unconstrained optimiza¬ 
tion on which gradient descent can be used. Unfortunately, 
it is not obvious if the LSL-BFE here can admit such a re- 
parametrization since the penalty functions are not pairwise 
and the variables are not binary. 

V. Inner-Loop Minimization and ADMM-GAMP 
A. ADMM Principle 

The outer loop algorithm. Algorithm]^ requires that in each 
iteration we solve a constrained optimization of the form 

{ba:,bz) = argmin s.t. E{z\bz) = AE(x|6^). 

(36) 

We will show that this optimization can be performed by the 
Alternating Direction Method of Multipliers (ADMM) 0. 
ADMM is a general approach to constrained optimizations 
of the form 

min/(w) s.t. Bw = 0, (37) 

W 

where /(w) is an objective function and B is some constraint 
matrix. Corresponding to this optimization, let us dehne the 
augmented Lagrangian 

L(w, u; t) = /(w) + u'''Bw -f i||Bw||^, (38) 


where u is a dual vector, r is a vector of positive weights 
and ||x ||2 ^ Tj. The ADMM procedure then produces 

a sequence of estimates for the optimization ( [37] i through the 
iterations 

= argminL(w, u*; t) (39a) 

W 

= u‘ -f Diag(l./x)Bw*+\ (39b) 

where Diag(d) creates a diagonal matrix from the vector d. 
The algorithm thus alternately updates the primal variables w* 
and dual variables uL The vector t can be interpreted as a 
step-size on the primal problem and an inverse step-size on 
the dual problem. 

The key beneht of the ADMM method is that, for any 
positive step-size vector r, the procedure is guaranteed to 
converge to a global optimum for convex functions /(w) 
under mild conditions on B. 


B. Application of ADMM to LSL-BFE Optimization 

The ADMM procedure can be applied to the linearized 
LSL-BFE optimization ( |3^ as follows. First, we replace 
the constraint E(z| 62 ) = AE(x| 6 a;) with two constraints; 
E(z| 62 ) = Av and E(x| 6 a;) = v. Variable splittings of this 
form are commonly used in the context of monotropic pro¬ 
gramming l|4§. With this splitting, the augmented Lagrangian 
for the LSL-BFE ( [T^ becomes 


L(&a:,& 2 ,s,q,v;Tp,Xr) 

= J{bx,bz,Tr.,Tp) -t- q'''(E(x|& 2 .) - v) -f s'^(E(z| 6 ^) - Av) 

+ ^liE(x| 6 ^) - v|j2^ -f ^|lE(z| 6 ^) - Aw\y, (40) 


where s and q represent the dual variables. Note that the 
vectors and Tp that appear in the linearized LSL-BFE 
J{bx,bz,Tr,Tp) have been used for the augmentation terms 
(i.e., the last two terms) in ( |40| ). This choice will be critical. 
From ([39]l, the resulting ADMM recursion becomes 


{by,by) = a.TgmmL{bx,bz,s*,qfv^;Tp,Tr), (41a) 

b^,b^ 

s*+i = s* -f Diag(l./xp)(E(z| 6 {+i) - Av‘+i), (41b) 

q‘+i = q‘ + Diag(l./T,)(E(x|e') - v*+'), (41c) 

= arg min L{by , bf^, s*+^, q*+\ v; Xp, r^). (41d) 

V 


To compute the minimization in ( |41a| i, we hrst note that the 
second and fourth terms in ( |40| can be rewritten as 

q^(E(x|6,.) - v) + }|E(x|6,.) - v||^_, 

= : ■ivinxi\K) ^ 


i=i 

n 

- + 
i=i 

n 

= E 


2 tj . 


x] - 2vjXj 

b \ 

2 Tr- 

V 2-., 

j - 

^x) '^Xj 

2 Tr 

^3 

2tj.. 

'3 


n 

■ 1 ^ ‘ rj 


-f const 


const 
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where in (a) we used 'E'^{xj\bx) = V.{x'j\bx) — vsx{xj\bx) = 
^{x'j\bx) — Tx/, in (b) we used to denote componentwise 
multiplication between vectors; and “const” includes terms 
that are constant with respect to bx and bz- A similar devel¬ 
opment yields 

s'^(E(z|&J - Av) + ^||E(z|6^) - Av||2^ (43) 

i—1 

Also, note that the last two terms in ([31]) can be rewritten as 


n 


(l./(2x^))'^var(x|0^) = ^ 

7 = 1 

(44a) 

m 

(l./(2xp))'^var(z|6J = ^ 

1 — 1 ^ 

(44b) 


Substituting ( [3T] l, ( |42| ), ( |4^ , and ( |44| ) into ( |40j l, and canceling 
terms, we get 


L{bx,bz,S,C[,V,Tp,Tr) 

= D{bx\\e-f^) + ^Edlx- (v-Tr.q)||2j6^) 

+ DibzWZ-^e-^^^) + ^Edlz - (Av - Tp.s)\\ljbz) 
+ const 

~ X" exp(-/^(x)-i||x-(v~T,..q)|| 2 ^)^^ 


6 z(z) In 


__ 

exp(-/,(z)-i||z-(Av-Tp 


rdz 


+ const (45) 

= D{bx\\px) + D{bz\\pz) + const, (46) 

forp:r(x) (X exp(-/a;(x)-i||x-(v-Tp.q)||2j andp^(z) oc 
exp(—/^(z)— ^Ijz— (Av —Xp.s)||^^). Therefore, the ADMM 
step ( |41a| ) has the solution, 

6 ^+^(x) cx exp ( - fx{x) - 5 l|x - r*\\lj (47a) 

b*+\z) (X exp ( - fz{z) - lllz - pdl^p). (47b) 


for vectors 


r* = v‘ - Tp.q‘ (48a) 

p* = Av‘-Tp.sd (48b) 

where we use to denote componentwise vector multiplica¬ 
tion. Using Bayes rule, ( |47al i can be interpreted as the posterior 
density of the random vector x under the prior e“A(x) 
independent Gaussian likelihood with mean r* and variance 
Xp. Similarly, ( |47b| l can be interpreted as the posterior pdf 
the random vector z under the likelihood e“A(z) and an 
independent Gaussian prior with mean p* and variance Tp. 

To tackle the minimization ( |41d| l, we ignore the v-invariant 
components in the original augmented Lagrangian ( |40| i, after 
which ( |41d| i can be reformulated as the least-squares problem 

v*+^ = argmin ||z‘+^ + — Av||^ 

V ^ 

-f ||x‘+i-f Tpq‘+i - vll^^ (49) 

using the definitions 

z'+i ^ E(z|6fi), x‘+i ^ E(x|6^+i). (50) 


Algorithm 3 ADMM-GAMP 


Require: Matrix A, estimation functions px and pz 
1 
2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 


S A.A (componentwise square) 
Initialize > 0,Tp > 0, v° 
qO ^ 0 , s° ^ 0 
0 

repeat 

{ADMM inner iteration} 
r* ^ V* — Tp.q* 
p‘ ^ Av* - T*.s* 

x*+^ ^ Pxir^T*), z*+^ ^ 5z(p‘, ) 

qt+i ^ qt _|_ Diag(l./rp)(x*+^ — v*) 
s*+^ •(— s* -f Diag(l.,Aj^)(z*+^ — Av*) 
Compute v*+^ from 


(Compute the gradient terms} 


^ 'rp- 5 i(r*,rp‘), 

rl+i ^ Sri+i 


.t-H 


^ Tp-9ziP\Tp) 


P 

-t+1 


^t+t 


^(l-rW./T*+^)./r: 
^ l./(SV}+i) 


/^t+i 

p 


(Update the linearization} 

Select a damping parameter 0* S [0,1] 
^ 6»*l./x}+i -f (1 - 0‘)l./rp* 

l./r^+i ^ 0*1 ./t^+i + (1 - 0‘)l./r^ 

until Terminated 


C. The ADMM-GAMP Algorithm 

Inserting the above ADMM updates into the outer loop al¬ 
gorithm, Algorithm]^ we obtain the so-called ADMM-GAMP 
method summarized in Algorithm]^ There and elsewhere, we 
use to denote componentwise vector-vector multiplication 
and to denote componentwise vector-vector division. Note 
that the updates for the ADMM iteration appear under the 
comment “ADMM inner iteration.” 

Although, in principle, we should perform an infinite num¬ 
ber of inner-loop iterations for each outer-loop iteration. Algo¬ 
rithm]^ is written in a more general “parallel form.” In each 
(global) iteration t, there is one ADMM update as well as 
one linearization update. However, by setting the outer-loop 
damping parameter as 0* = 0, it is possible to bypass the 
linearization update. Thus, we can obtain the desired double¬ 
loop behavior as follows: First, hold 0* = 0 for a large number 
of iterations, thus running ADMM to convergence. Then, set 
0* > 0 for a single iteration to update the linearization. Then, 
hold 0* = 0 for another large number of iterations, and so on. 
However, the parallel form of Algorithm]^ also facilitates other 
update schedules. For example, we could run a small number 
of ADMM updates for each linearization update, or we could 
run only one ADMM update per linearization update. 

An interesting question is whether the algorithm can be run 
with a constant step-size 0* = 0 for some small 0. Unfor¬ 
tunately, our theoretical analysis and numerical experiments 
consider only the double-loop implementation where several 
ADMM iterations are run for each outer loop update. 

Another point to note in reading Algorithm is that the 














expectation and variance operators in ( [3T| l, ( |41b| l, and ( |41c| ) 
have been replaced by componentwise estimation functions 
and Pj, and their scaled derivatives. In particular, recall from 
that is fully parameterized by (r‘, r*) and that is 
fully parameterized by (p*, Tp). Thus, we can write the means 
of these distributions as 


= E(x| 6 ^+^) = g^{r*, t*), (51a) 

z‘+i=E(z|6f)=5.(p‘,r^), (51b) 

as reflected in line 1^ of Algorithm!^ For separable and /^, 
we note that the computations in pi) can be performed in a 
componentwise, scalar manner, e.g.. 


-r 





_A 

9a=A^J 

,<■) 



4 a:exp(- 

-fa 

- 

^{x 

3 

-r]f 

■) da: 

/Rexp(- 

/xj 

fx) - 

2 ^(t- 

- 

1 dx 


A 
1 — 

9^, {p\ 




/K^exp(- 

-h 

M)- 


-pd^: 

)dz 

/Hexp(- 


(z) - 


pI?) 

dz ’ 


(52) 


(53) 


Furthermore, the variances of 6 ^^ and 6 *+^ can be computed 
in a componentwise manner using the derivatives of gx^ and 
gzi with respect to their first argument p 2 ), i.e.. 


= var(x| 6 ^+i) = T*.gx{r\ t*), (54a) 

T*+^ = var(z| 6 fi) = r^.g^ip^ r^), (54b) 


as 


reflected in line 15 of Algorithm]^ That is, 




dgxM,Tl^) 


drj 


(55) 


We use these general scalar estimation functions gx and g^ 
since it will allow us later to consider a similar algorithm for 
the MAP estimation problem (|^. 

Interestingly, the ADMM-GAMP algorithm has close sim¬ 
ilarities to the sum-product version of the original GAMP 
algorithm from p2), as we will discuss in Section VIII For 


example, the sum-product version of the GAMP algorithm 
uses the same estimation functions gx and g^ from ( |5T) , which 
we will refer to as the MMSE estimation functions. 


D. Computational Cost 

While we will demonstrate below that ADMM-GAMP 
offers improved convergence stability relative to the GAMP 
algorithm of | [ 22 ) , it is important to point out that the com¬ 
putational cost of ADMM-GAMP may be somewhat larger: 
One of the main attractive features of GAMP and other first 
order methods, is that each iteration requires only matrix- 
vector multiplies by by A and A^. Each such multiplication 
will have complexity 0(mn) in the most general case, and 
may be smaller for structured transforms such as filters, FFTs, 
or sparse matrices. 

In contrast, ADMM-GAMP requires a least-squares (LS) 
minimization in each iteration. Exact evaluation of the 
minimization will have a cost of 0(n?"m) - a cost not incurred 
in GAMP or most other first-order methods. As is done 


ADMM - and in the simulations below - the minimization 
can be performed approximately via conjugate gradient (CG) 
0 - Conjugate gradient also requires repeated matrix-vector 
multiplies by A and A^, but will require K such matrix-vector 
multiplies where K is the number of CG iterations. In the 
simulations below, we will use AT = 3, thus increasing the per 
iteration cost of ADMM-GAMP by a factor of approximately 
3 relative to standard GAMP. 

The other computations in each iteration of ADMM-GAMP 
are typically smaller than the LS minimization and are sim¬ 
ilar to those performed in GAMP. Eor example, similar to 
GAMP, each iteration requires evaluation of the estimation 
functions gx{‘) and gz{‘)- These can be performed as n and 
m componentwise scalar functions given in ( |5^ and ( |5^ . Eor 
certain penalty functions, such as Bernoulli-Gaussians, these 
will have closed-form expressions; otherwise, they will need 
to be evaluated via numerical integration. In either case, the 
componentwise cost does not grow with the dimension, so 
the per iteration cost of evaluating the estimation functions is 
0{m + n) and are typically not dominant for large m and n. 

VI. ADMM-GAMP FOR MAP ESTIMATION 
A. ADMM Inner Loop 

Eor the posterior density p(x|y) in (|^, the MAP estimates 
of the vector x and its transform z = Ax are given by the 
constrained optimization 

(x,z) = argmin J(x, z) s.t. z = Ax, (56) 

X.Z 

where J(x, z) is the objective function 

'^(x,z) =/a:(x)-f/^(z). (57) 

We will show that, with appropriate selection of the estimation 
functions, gx and gz, the inner loop of Algorithmcan be used 
as an ADMM method for solving 

As before, we replace the constraint z = Ax in the 
optimization ( |56) with two constraints: x = v and z = Av. 
We then define the augmented Lagrangian 

L(x,z,s,q,v;7-p,T^) 

- /x(x) + /z(z) + q'''(x - v) -f s'''(z - Av) 

+ il|x-v||2^-f i|jz-A v||2^. (58) 

The ADMM recursions ( |39) for this augmented Lagrangian 
are then given by 

(x‘+\z‘+^) = argminL(x,z,s‘,q*, v*;Tp,Tr), (59a) 

X,Z 

s‘+i = s‘ -f Diag(l./Tp)(z*+i - Av*), (59b) 

q‘+i = q* Diag(l./T^)(x‘+1 - V*), (59c) 

v‘+^ = arg min L(x*+^, z‘+^, s*+^, q*+^, v; Xp, r^.). (59d) 

V 

To perform the minimization in ( |59a| i, first consider the 
minimization over x. Eliminating terms that do not depend 
on X, we obtain 

x*+^ = argmin/: j,(x) -f q'''x -f i||x - v||^^ 

X 

= argmin/: j,(x) -f i||x-f r^.q - v||^^. (60) 
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Similarly, the minimization over z reduces to 

z‘+i = argmin/ 2 (z) + s'''z + i|lz - Av||2 


Tp.S - Av|| 


= argmin/2(z) + i|| 

Z 

Hence, if we define the estimation functions 

g^(r, Tr) = argmin [/^(x) + ||lx - r\\l^ 

X 

gziP,T-p) = argmin /^(x) + ^\\z - p\\l^ 

Z L 

then we can rewrite ( |60l l and © as 

= gx (r‘ , 7-^) , Z*+1 = g, (p* ,Tp), 


(61) 

(62a) 

(62b) 

(63) 


for r* and p‘ dehned in ( |48] l. Also, the minimization over v 
in ( |59d| i can again be cast as the least-squares problem ( |49l l. 

We see that equations (|4g, (|^, <1^ and @ are 

precisely the updates in the ADMM inner-loop of Algorithm]^ 
Therefore, for fixed penalty terms Tr and Tp, the inner loop 
of the ADMM-GAMP algorithm with the estimation functions 
( |62| l is precisely an ADMM algorithm for the MAP estimation 
problem ( |5^ . 

The functions in ( [62| ) are the standard “proximal” operators 
used in many implementations of ADMM and related opti¬ 
mization algorithms These functions also appear in the 
max-sum version of GAMP from 122) , which is used for MAP 
estimation. Thus, we will refer to ( |62| l as the MAP estimation 
functions. 


is through a standard “hardening” analysis, which is also used 
to understand how max-sum loopy belief propagation can be 
viewed as a limit of sum-product loopy BP; see, for example, 
1481, p9| . Specifically, combining with Laplace’s Principle 
from large deviations |50|, and assuming suitable continuity 
conditions, the marginal minimization functions ( |64l i are given 
by (up to a constant factor) 


{Xj ) = - j,im T In {xj ; T), 


where [xj] T) and (z^; T) are the marginal densities for 
the scaled joint density 


p(x;r) = 


I exp 


-;^(/x(x) + /z(Ax)) 


Note that, for any T > 0, we can estimate the marginal posteri¬ 
ors px {xj \ T) and p^. {zi\ T) using the LSL-BFE optimization 
from Section |V] That is, we can use the estimate 

fxj (xj) ^ fxj (xj) = - j^im T In {xj;T), (66a) 

i'ziizi) « fzi{zi) = - lim Tln5^.(zi;T), (66b) 


where bxj{xj;T) and bz.{zi]T) are the belief estimates com¬ 
puted via the LSL-BFE optimization under the scaled penalties 

/,(x;T)4/,(x)/T, /,(z;T)4/,(z)/T. (67) 


B. Hardening Limit of the LSL-BFE 

The above discussion shows that, with the MAP estimation 
functions ( |62| ), the ADMM-GAMP outputs (x*,z‘) can be 
interpreted as estimates of the MAP solution from ( |56l l. How 
then do we interpret the related terms In the infer¬ 

ence (i.e., MMSE) problem from Section |V] the components 
of T* and T* are estimates of the variances of the marginal 
posteriors. Below, we use a hardening argument to show that, 
in the MAP problem, can be interpreted as estimates 

of the local curvature of the MAP objective ( |57| ). 

To be precise, let us dehne the marginal minimization 
functions 

fxj (xj) = min J (x, Ax), (64a) 

x\xj 

fziizi) = min J(x,Ax), (64b) 

x:2i = [Ax]i 

where the minimizations are over the vector x, holding either 
Xj or Zi = [Ax]j fixed. Note that, if one can compute these 
marginal minimization functions, then one can compute the 
components of the MAP estimates from ( [56] l via 

Xj = argmin(/>a;^.(xj), % = B.TgmiTi(l)z.{zi). (65) 

Xj Zi 

However, the marginal minimization functions provide not 
only componentwise objectives for the MAP optimization 
( |5^ , but also the sensitivity of those objectives. 

We will see that ADMM-GAMP provides estimates of the 
marginal minimization functions, in addition to estimates of 
the MAP solution in ([56]l. Perhaps the easiest way to see this 


In statistical physics, the parameter T has the interpretation of 
temperature, and the limit T ^ 0 corresponds to a “cooling” 
of the system. In inference problems, the cooling has the effect 
of concentrating the distributions about their maxima. 

A large-deviations analysis in Appendix [B| shows that, if we 
use ADMM-GAMP with the MMSE estimation functions (|5T| 
with the scaled functions (|67]), then at iteration t the limits in 
(|66]) are given by 



= — lim Tln&* (a;,;T), 

T^O J' 

= fxj (Xj) + 2-r{xj — Tj) 

(68a) 


= — lim Tln&(, (zp-T) 



r-fO 



= fzi{Zi) ~ Pi) ; 

(68b) 


where the parameters rj, r* , and are the outputs of 
ADMM-GAMP under the MAP estimation functions ( |62] i. In 
this sense, ADMM-GAMP under the MAP estimation function 
can be seen as a limiting case of ADMM-GAMP under the 
MMSE estimation functions. Hence, according to ( |66l l, MAP 
ADMM-GAMP can be used to compute estimates ( |68l l of the 
marginal minimization functions ( |64| l. Furthermore, according 
to ( |6^ and ( |6^ , x*+^ and z*+^ are the minima of these 
functions 

= argmin^^^(a;j), z^^ = argmin^^z^), 

Xj Zi 

as one would expect from (|65l). 
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Finally, it can be shown (see ( |1 10| l) that, for the MAP 
estimation functions ( |6^ , the outputs of line[^in Algorithmj^ 
take the form 


II 

A 

Trj 

* 1 

(69a) 

Xz, = Tp-g'z:.iPt,Tp.) 

^ + xpj’f{%y 

(69b) 

Meanwhile, from (|68|). we see 

that 



1 d^fiy-z^y 

(70) 

“ 9x2 . 

rif dz^ ■ 


Therefore, when ADMM-GAMP is used for MAP estimation, 
the components of and r* can be interpreted as the inverse 
curvatures of the constrained function J(x, z = Ax) in the 
vicinity of the current estimate (x‘,z‘). 

Appendix also show that, in the limit as T —0, the 
LSL-BFE optimization ( [T4l i decomposes approximately into 
two decoupled optimizations: The first computes the MAP 
estimates (x,z) from (|5^, and the second computes 


{%,%) = argmin J^(x2;,x^,x,z), 


(71) 


where 


J2(x^,T^,X,z) = ^ - ln(T^J 

i=i 

1 


(72) 


-E 

2 = 1 


/".(2'0 + — +ln 


and, as before, Xp = Since the optimization ( |7T] l provides 
the inverse-curvature estimates in ( |70| ), we will refer to it as 
curvature optimization. 


VII. Convergence Analysis for Strictly Convex 
Penalties 

A. Fixed Points of ADMM-GAMP 

We hrst characterize the hxed points of ADMM-GAMP, 
assuming that the algorithm converges. 


B. Convergence of the ADMM Inner Loop 

For the remainder of this section, we will show the con¬ 
vergence of ADMM-GAMP in the special case of convex and 
smooth penalties f^ and We begin by analyzing the conver¬ 
gence of the ADMM inner-loop under hxed linearization terms 
Xr and Xp. It is well-known that, when one applies ADMM to 
a general optimization problem of the form ( |37] i with convex / 
and full-rank B, the method will converge Q. However, in our 
case, the objective function is the linearized LSL-BFE in ( [3T] l, 
which is not necessarily convex, even if the penalty functions 
fx and fz are. The problem is that the variances var(x|& 2 ,) and 
var(z| 6 ^) are not convex functions of the densities and 
(in fact, they are concave). We thus need a separate proof. 

We will prove convergence under the following assumption. 

Assumption 2: Lor hxed and Xp, the estimation functions 
gx{tc,Tr) and 5 z(p,Xp) are separable in r and p in that 

9x{x,Tr)= {gxAxi^Xr),--- , (r„, X^)) , 

9z{M,Tp) = {gzdPl^Xp),--- ,g^^{pm,Tp)) 

for scalar function gx^ and . In addition, these scalar func¬ 
tions have, with respect to their hrst arguments, continuous 
hrst derivatives 5 ^^ and g'^, satisfying 

e < g'xj {rj,'rr) < 1 - e, e < 5 ),. {pi, Xp) < 1 - e. (73) 

for some constant e G (0,0.5]. 

The assumption requires that the estimation functions are 
strictly increasing contractions. Importantly, the following 
lemma shows that this assumption holds when the penalty 
functions are smooth and convex. 

Lemma 1: Suppose that fx and are strictly convex, 
separable functions, in that they are of the form Q, where 
the components have continuous second derivatives such that 

A < /" (x,) < B Vx„ A < /" (z,)<B 'iz ,, (74) 

for some 0 < A < B < 00 . Then, both the MMSE estimation 
functions in •HD and the MAP estimation functions in ( | 6 ^ 
satisfy Assumptionfor any Xr,Xp > 0. 

Proof: See Appendix [P] ■ 

We now have the following convergence result. 

Theorem 4: Consider Algorithm with only ADMM up¬ 
dates (i.e., 0 * = 0 for all f), so that the linearization terms 
remain constant, i.e.. 


Theorem 2: At any fixed point of ADMM-GAMP with the 
MMSE estimation functions the belief pair {bx^bf} in 
( |47l i is a critical point of the constrained LSL-BLE optimiza¬ 
tion ( flAl i. 

Proof: See Appendix jC] ■ 

Theorem 3: At any fixed point of ADMM-GAMP with the 
MAP estimation functions ( |62| l, the output (x, z) is a critical 
point of the constrained MAP optimization ( [5^ and {tx,Tz) 
is a critical point of the optimization ( [7T] i. 

Proof: See Appendix jC] ■ 

Theorems and show that, if ADMM-GAMP converges, 
then its limit points will be local minima of either the inference 
(i.e., MMSE) or MAP problems. 


Xp = Tp and T* = Tr Vf, 

for some vectors Tp and x^. Then, if the estimation functions 
satisfy Assumption the algorithm converges to a unique 
hxed point at a linear rate of convergence of 1 — e. 

Proof: See Appendix]^ ■ 

C. Outer Loop Convergence: MMSE Case 

Theorem]^ shows that, with the MMSE estimation functions 
and strictly convex penalties, the ADMM inner loop of 
Algorithm converges. We next consider the convergence 
of the outer loop. Algorithm assuming that the inner 
minimization (i.e., linej^of Algorithmic is computed exactly. 
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Theorem 5: Suppose that the functions fx and fz satisfy 
the assumptions in Lemma and the matrix S has positive 

components (i.e., Sij = > 0 Vij). Then, there exists 

a 9 such that, if 9^ < 9, the sequence of belief estimates 
generated by Algorithm yields a monotonically non¬ 
increasing LSL-BFE, i.e.. 

Proof: See Appendix]^ ■ 

Together, Theorems and demonstrate that ADMM- 
GAMP will converge under an inhnitely slow damping sched¬ 
ule. Specifically, we select iterations ti < t2 < ■ ■ ■ that are 
infinitely far apart. Then, for all t between each tk and t^+i, we 
set 0* = 0 so that the ADMM inner-loop is run to completion, 
and at each t = tk, we select 0* to be a small positive value. 

It is of course impossible to use an inhnite number of 
inner-loop iterations in practice. Fortunately, our numerical 
experiments in Sectionjl^suggest that a hxed number of inner- 
loop iterations is sufficient. 


D. Outer Loop Convergence: MAP Case 

We can prove a stronger convergence result for ADMM- 
GAMP under the MAP estimation functions ( |6^ , if we make 
two additional assumptions. Recall from Theorem that, if 
we set 9* = 0 Vi, then the linearization parameters r* and 
will remain constant with t and the algorithm will converge 
to some hxed point. Our hrst assumption is that we begin the 
algorithm at one such hxed point. That is, we suppose that the 
time t = 0 versions of 


p , q , s , V 


(75) 


are hxed points of lines through in Algorithm Our 
second assumption is that we replace the update in line[l7| 
with 

^(l-rf\/r‘)./r‘. (76) 

That is, we use instead of Under these two additional 
assumptions, we can prove the following. 


Theorem 6: Consider ADMM-GAMP, Algorithm run 
under the MAP estimation functions ( |62| ), with penalty func¬ 
tions fx and fz satisfying the assumptions of Lemma 
Suppose that the initialization ( |75] l is a hxed point of lines 
1^ through [T^ and that line is replaced by fTh] ). Then, if 
= 1 for all t, 

(a) Even though and may change with t, the variables 
in (|75|l will remain constant. That is, for all t. 


„t _ „0 _ „0 

A. — A. ^ ^ ^ ^ 1 — 1 , ^ ^ , 

qt ^ qO^ gt ^ gO^ ^ ,^0^ 


(77) 


Moreover, the variables (x°,z°) are the global minima of 
the MAP estimation problem ( |56] l. 

(b) The linearization parameters and r* converge to unique 
global minima of the curvature optimization (in). 

Proof: See Appendix [G| ■ 


The result shows that, in principle, we can solve the MAP 
estimation problem by hrst running the ADMM inner loop to 


convergence with arbitrary positive linearization terms and 
Tp. Then, we could turn on the outer loop updates, thus driving 

and rj to the minima of the curvature optimization problem 
(in). Of course, in practice, one cannot do this perfectly, since 
the ADMM inner loop must be terminated at some hnite 
number of iterations. Also, it is possible that, by letting the 
variance terms adapt (at least slowly) before the inner loop 
fully converges, the convergence speed of the inner loop can 
be improved. In fact, this is our empirical experience, although 
we have no proof. 

It is important to point out that the MAP convergence proof 
requires a slightly modihed variance update given in S- 
This update may actually be preferable for the MMSE case 
as well, however, further analysis would be required. Indeed, 
while we have demonstrated one variance update with provable 
convergence, hnding the best variance update method is a still 
an open question. 

VIII. Relationship of ADMM-GAMP to GAMP 

There are two key differences between the proposed 
ADMM-GAMP algorithm and the original sum-product 
GAMP algorithm from p2) , reproduced for convenience in 
Algorithm (with the variance updates indented for visual 
clarity). 

1) The ADMM-GAMP algorithm uses two additional vari¬ 
ables: a dual variable q‘, and an auxiliary variable v* that 
is updated via the least-squares optimization ( |49l l, that are 
not present in the original GAMP algorithm. 

2) ADMM-GAMP uses an alternating schedule of mean and 
(possibly damped) variance updates, whereas GAMP uses 
interleaved mean and variance updates. 

Below, we describe these differences in more detail. 


Algorithm 4 Original GAMP 
Require: Matrix A and estimation functions px and 
I: S A.A (componentwise square) 

2: Initialize x°, ri? 

3: s° ^ 0 
4: f ^ 0 
5: repeat 
6: rl ^ St* 

1 : p‘ ^ Ax* - Tp.S* 1 

8 : T*^T*.g',{p*,T*) 

9: Z*^gzip*,T*) 

10 : ^ {l-Tl.lT*).fr* 

II: S* ^ [Z* -P*)./t* 

12: T* ^ l./(SV*) 

13: r* 3— X* -f Diag(r,()A^s* 

14: T*+^ ^ T*.g'x{r*,T*) 

15: ^ gx{r*,Tf) 

16: until Terminated 


A. Sum-product GAMP via Stale, Linearized ADMM 

One way to understand the differences between ADMM- 
GAMP and the original GAMP is as follows: ADMM-GAMP 
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results from minimizing the linearized LSL-BFE via ADMM 
under the splitting rule “E(z| 62 ) = Av and E(x| 6 a;) = v” (as 
described in Section V-B| i, whereas the original GAMP uses 
stale, linearized ADMM under the conventional splitting rule 
“E(z| 62 ) = AE(x| 62 :)-” Both use the same iterative LSL-BFE 
linearization strategy described in Section 1V-D| 

We can derive the mean updates in the original GAMP using 
the augmented Lagrangian 

L{b^,b^,s;Tp) = J{bj;,bz,Tr,Tp) -f s'''(E(z| 62 ) - AE(x| 62 :)) 

+ i|lE(z|62)-AE(x|6,)||2^, (78) 

for the J defined in ( [3T| l and stale, linearized ADMM: 
bT = argminL(6^,h‘,s*"^Tp) + i(E(x| 63 ;) - E(x|&J,))'^ 

X - A^D^^ A) (E(x|6,) - E(x|6^)), (79a) 

= argminL( 6 ^+\ 62 , s‘; Xp), (79b) 


= s‘ 


.D^^(E(z|6«)-AE(x|e')), 


(79c) 


where D^- = Diag(l./T). Note the addition of a “lin¬ 
earization” term in ( |79a| l to decouple the minimization. The 
resulting approach goes by several names: linearized ADMM 
0 Sec. 4.4.2], split inexact Uzawa and primal-dual 
hybrid gradient (PDHG ) p0| . Note also the use of the “stale” 
dual estimate in ( |79a| i, as opposed to the most recent 
dual estimate s*. In the context of PDHG, this stale update 
is known as Arrow-Hurwicz m- In Appendix we show 
that the recursion ( |79l l yields the mean updates in the original 
sum-product GAMP algorithm (i.e., the non-indented lines in 
Algorithm]^. 

Regarding the variance updates of the original sum-product 
GAMP algorithm (i.e., the indented lines in Algorithm |^, 
a visual inspection shows that they match the non-damped 
ADMM-GAMP “gradient” updates (i.e., lines T5p8 of Algo¬ 
rithm [funder 9* = 1 ), except for one small difference: in the 
original sum-product GAMP, the update of Tg uses the same 
version of Tp used by the update, whereas in ADMM- 
GAMP, the update of Tg uses a more recent version of Tp. 


B. Recovering GAMP from ADMM-GAMP 

We now show that the mean-updates of the original sum- 
product GAMP can be recovered by approximating the mean- 
updates of ADMM-GAMP. For simplicity, we suppress the t 
index on the variance terms. 

At any critical point of Algorithm ^ we must have q‘ = 
—A^s‘ and z‘ = Ax‘, as shown in ^l07| ). If we substitute 
these two constraints into the v-update objective in ( |4^ , we 
obtain 

||z‘-fx,.s‘-Av||2^ + ||x‘ + r,.q‘-v||?.^ 

= l|A(x* - v) -f Tp.s^ll^ + ||x‘ - V - Diag(x^)AV||2^. 

It can be verified that the minimum for this function occurs 
at V = X*. So, if we substitute v* = x* and q* = —A^s* into 

'See, e.g., Sec. 3.1]. 


the mean updates in Algorithm we obtain 

= 9x{r\Tr), 

= 9zip\Tp), 

s‘+i = s‘ + Diag(l./Tp)(z*+1 - Ax*), 
rt+i ^ ^t+i Diag(rr)A'''s*+\ 

P*+1 = Ax*+1 - Tp.S*+\ 

Then, substituting the p update into the s update, dehning 
z* = z*+^ and s* = and reordering the steps, we obtain 

P* = Ax* — TpS*“^, 

= 9zip\rp), 

s* = Diag(l./Tp)(z* - p*), 
r* = X* -f Diag(Tr)A'^s*-\ 

= 9x{r\Tr), 

which is precisely the GAMP mean-update loop. 


IX. Numerical Experiments 

We now illustrate the performance of ADMM-GAMP by 
considering three numerical experiments. While our theoretical 
results assumed strictly convex penalties, we numerically 
demonstrate the stability of ADMM-GAMP for the non- 
convex penalty corresponding to a Bernoulli-Gaussian prior 
on X, i.e.. 


Pxix) = (1 - p)S{x) -f pAf{x; 0,1), 


(80) 


where p G (0,1] is the sparsity ratio and 6 is the Dirac 
delta distribution. In our experiments, we fix the parameters 
to n = 1000 and p = 0 . 2 , and we numerically compare the 
normalized MSE 


x-x 


NMSE (dB) = lOlogio 

of ADMM-GAMP to four other recovery schemes: the original 
GAMP method |22|; de-biased LASSO ||52|; swept AMP 


(SwAMP) p7) ; and the support-aware MMSE estimator, la¬ 
beled “genie.” The SwAMP method is identical to original 
GAMP method but updates only one component of x at a 
time - a common technique also used for stabilizing loopy 
BP. For LASSO, we optimized the regularization parameter 
A for best MSE performance. For GAMP, SwAMP, and 


x‘-M |2 


?t-l| 


x "^"||2 < 10 “^ and imposed an upper limit of 200 
iterations. In all experiments below, ADMM-GAMP was run 
with 10 iterations of the inner loop ADMM minimization for 
each outer loop update. Also, the least-sqaures minimization 
( |49| ) was performed with 3 conjugate gradient iterations per 
inner loop iteration, using as a warm start, the output of final 
value from the previous iteration as the initial condition of the 
current iteration. 

In our first experiment, we consider a standard problem: 
recover sparse x from y = Ax -f e, where e is AWGN 
with variance set to achieve an SNR of 30 dB, and where the 
measurement matrix A is drawn with i.i.d. A/'(0, 1/m) entries. 
Figure shows the NMSE performance of the algorithms 
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Measurement ratio {m/n) 



Fig. 2. Average NMSE versus measurement rate m/n when recovering 
a length n = 1000 Bernoulli-Gaussian signal x from AWGN-corrupted 
measurements y = Ax + e under i.i.d. A. 


under test after averaging the results of 100 Monte Carlo trials. 
Here, since y and z = Ax are related through AWGN, the 
GAMP algorithm of reduces to the Bayesian version of 
the AMP algorithm from p8| . 

Note that the case of i.i.d. A is the “ideal” scenario for 
both AMP and GAMP. As discussed in the Introduction, their 
convergence in this case is guaranteed rigorously through state 
evolution analysis 122 |-p4) as m,n —i oo. In Figure]^ since 
m and n are sufficiently large, it is not surprising to see 
that GAMP performs well over all measurement ratios m/n. 
Furthermore, it is interesting to notice that GAMP outperforms 
LASSO and obtains NMSEs that are very close to that of 
the support-aware genie. Under such ideal A, the proposed 
ADMM-GAMP method matches the performance of GAMP 
(since it minimizes the same objective) but does not offer any 
additional benefit. 

The benefits of ADMM-GAMP become apparent in our 
second experiment, which uses non-i.i.d. matrices A. In de¬ 
scribing the experiment, we first recall that p5j established 
that the convergence of GAMP can be predicted by the peak- 
to-average ratio of the squared singular values, 

a?(A) 


n{A)^ 


ELi^?(A)/r’ 


( 81 ) 


where r = mm{m,n} and (7i{A) is the i-th largest sin¬ 
gular value of A. When this ratio k is sufficiently large, 
the algorithm will diverge. Thus, to test the robustness of 
ADMM-GAMP, we constructed a sequence of matrices A 
with varying k, as follows. First, the left and right singular 
vectors of A were generated by drawing an m x n matrix 
with i.i.d. Af{0,l/m) entries and taking its singular-value 
decomposition. Then, the singular values of A were chosen by 
setting the largest at cri(A) = 1 and logarithmically spacing 
each successive singular value to attain the desired peak-to- 
average ratio k. 

As a function of k, the NMSE performance of the various 
algorithms under test is illustrated in Eigurej^for the case of 
TO = 600 measurements. There it can be seen that, for larger 
values of k, the NMSE performance of the original GAMP 
algorithm deteriorated, which was a result of the algorithm 


Fig. 3. Average NMSE versus peak-to-average squared-singular-value ratio 
k(A) when recovering a length n = 1000 Bernoulli-Gaussian signal x 
from m = 600 AWGN-corrupted measurements y = Ax -|- e. Note the 
superior performance of ADMM-GAMP relative to both the original GAMP 
and SwAMP, and the proximity of ADMM-GAMP to the support-aware genie. 



Fig. 4. Average NMSE versus peak-to-average squared-singular-value ratio 
k(A) when recovering a length n = 1000 Bernoulli-Gaussian signal x from 
m = 2000 noiseless 1-bit measurements y = sgn(Ax). Note the superior 
performance of ADMM-GAMP relative to the original GAMP and SwAMP. 


diverging. (Note that, in the plot, we capped the maximum 
NMSE to 0 dB for visual clarity.) The figure also shows 
that the SwAMP method achieved low NMSE over a wider 
range of k ratios than the original GAMP method, but its 
performance also degraded for larger values of k. The ADMM- 
GAMP method, however, converged over the entire range of 
K values, achieving NMSE performance relatively close to the 
support-aware genie. 

In our third and final experiment, we recover x from “one- 
bit” measurements y = sgn(Ax), where sgn is the sign 
function, as considered in, e.g., | |5^ and pOj . Here, we used 
TO = 2000 measurements and generated the matrices A as 
in our second experiment. Eigure shows the NMSE perfor¬ 
mance of the various algorithms under test. The results in the 
figure illustrate that the original GAMP method diverged for 
K >2. However, both SwAMP and ADMM-GAMP recovered 
the solution for the whole range of k without diverging, with 
ADMM-GAMP yielding slightly better NMSE (about 0.3 dB 
better) at higher values of k. 
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Conclusions 

Despite many promising results of AMP methods, the major 
stumbling block to more widespread use is their convergence 
and numerical stability. Although AMP techniques admit 
provable guarantees for i.i.d. A, they can easily diverge for 
transforms that occur in many practical problems. While sev¬ 
eral methods have been proposed to improve the convergence, 
this paper provides a method with provable guarantees under 
arbitrary transforms. The method leverages well-established 
concepts of double-loop methods in belief propagation as 
well as the classic ADMM method in optimization 0. 

Nevertheless, there is still much work to be done. Most 
obviously, the proposed ADMM-GAMP method comes at a 
computational cost. Each iteration requires solving a (poten¬ 
tially large) least squares problem ( |49] l that is not needed in 
the original AMP and GAMP algorithms. Similar to stan¬ 
dard applications of ADMM, this minimization can likely 
be performed via conjugate gradient iterations, but its imple¬ 
mentation requires further study. In any case, it is possible 
that ADMM-GAMP will be slower than other variants of 
GAMP. Indeed, our simulations suggest that other methods 
such as SwAMP or adaptively damped GAMP | |28) may 
provide equally robust performance with less cost per iteration. 
One line of future work would thus be see to whether the 
proof techniques in this paper can be extended to address these 
algorithms as well. 

The analysis in this paper might also be extended to other 
variants of AMP and GAMP. For example, it is conceivable 
that similar analysis could be applied to develop convergent 
approaches to the expectation-maximization (EM) GAMP de¬ 
veloped in pT) , ||54|-||57), turbo and hybrid GAMP methods 
in | [58) , | [59| and applications in dictionary learning and matrix 
factorization 


Appendix A 
Proof of Theorem[T] 

Throughout this appendix, we use the shorthand notation 
for the gradient = dh{T)/dT S K^. 

First we show, by induction, that 7 ^ G E for all k. Recall 
that, by the hypothesis of the theorem, 7 ° G E. Now suppose 
that 7 ^ G E. Then the updates in Algorithm imply that 

h'{T^) = h'(g(b'=)) = h'(g(b(7'=))). 

Then, by Assumption [^c), h'{ t^) G E. Since 7 ^ G E, 6 *^ G 
(0,1], and E is convex, 

7*4-1 ^ ^ g Y. 

Thus, by induction, 7 ^ G E for all k. 

Next, we prove the decrementing property ( |Z7| l. First ob¬ 
serve that since the restriction b G i? is a linear constraint, 
we can find a linear transform B and vector bo such that 
b G -B if and only if b = Bx + bg for some vector x. 
It can be verihed that we can reparameterize the functions 
/(•) and g(-) around x and obtain the exact same recursions 
in Algorithm [T] Also, all the conditions in Assumptions [T] 
will hold for reparametrized functions as well. Thus, for the 


remainder of the proof we can ignore the linear constraints B, 
or alternatively view B as the entire vector space. 

Under this assumption, for any 7 , and any minimizer b( 7 ) 
will be in the interior of B and therefore, 

0 = = /(b(7)) + ^ge(h{j))'yi (82) 


ab 

= /'(b(7)) +5'(b(7))7, 


(83) 


where /'(b) is shorthand notation for the gradient df{h)/dh, 
g'i(h) is shorthand for the gradient (with respect to b) of 
the fth component of the vector-valued function g( ), and 
where ^'(b) = [pj(b),... , 5 /(b)] is matrix-valued. Taking 
the gradient of (|82l) with respect to 7 ^ yields the matrix 


(a) a/'(b(7)) 

97^ 


( 6 ) 


= H(7) 


abT 

ab(7) 


E 

L 

E 


dg'Mi)) 


57 ^ 

9/'(b(7)) , dg[{h{^)) 

t=i 


7^ +5'(b(7)) 
5b ( 7 ) , 


57 ^ 


abT 
ff'(b(7)), 


7r 




ff'(b(7)) 


(84) 


where (a) and (b) follow from the chain rule and H( 7 ) is the 
Hessian from (|25|l. Equation (|84li then implies 


5b(7) 

^7^ 


= -H(7)-^g'(b(7)) 


(85) 


where Assumption[2b) guarantees the existence of the inverse. 
The gradient of the objective with respect to 7 ^ is then 


(“) 


^7^ 


-|T 


5b(7) 


/'(b(7)) -f g'(b(7))/i'(x) 

S(4(r)-7)V(b(7)f^ 

= (7-4(7)}^5'(b(7))^H(7)-V'(b(7)), (86) 


where (a) follows from ( [ 22 | ) and the chain rule, (b) follows 
from ( |^ , and (c) follows from ( |85l l. 

Notice that the 7 *^ update in Algorithm can be written as 

= (h'(T'=) -7'=)0'=. 

Taking an inner product of the above and ( | 86 l l evaluated at 
7 = 7 ^, we get 


97 ^ 


( 7^+1 _ 7 /c) 


= -(7'= - h'(x'=))V(b")^H(7'=)-i5'(b'=)(7'' - h'{T^))e 
<--|k'(b'=)(7'^-/i'(r''))f, 


C2 


(87) 


recalling that b( 7 ^) = b*^ and that C 2 was defined in 
Assumption [^b). Therefore, the update of 7 *^ is in a descent 
direction on the objective J(b( 7 )). Hence, for a sufficiently 
small damping parameter 6^, we will have 

J(b'=+^) - J(b'=) = J(b(7'=+i)) - 7(6(7'=)) < 0, 
which proves the decrementing property ([ZT]). 
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Appendix B 

Large Deviations View oe MAP Estimation 

For each T > 0, let x‘(T), z‘(T),..be the output of 
the ADMM-GAMP algorithm with the MMSE estimation 
functions a and the scaled penalties (|67]i. Next, we define 
several limits. For the mean vectors we define 


X* = lim x‘(T), z‘ = lim z‘(T), 
T-s-O T-fO 


for the dual vectors we define 


q* = lim rq*(T), s‘ = lim Ts\T), 

T-fO T-fO 


and for the variance terms we define 

T-s-O T T-fO T 

<{T) 


t^(T) 

ri = lim '' , tI = lim '^ , t* = lim —, (88a) 


T-)-0 T 


r' = lim 


T-fO T 




= lim Tr*(T). 
T-i-O 


(88b) 


We will assume that all of these limits exist. Note that some of 
terms are scaled by T and others by 1 /T. These normalizations 
are important. It is easily checked that the scalings all cancel, 
so that the limiting values satisfy the recursions of Algorithmj^ 
with the limiting estimation functions 


gx{r,rr) = lim p^(r, Tr(T); T) = \im g^{r,TrT;T), (89a) 

T-i-O T—>0 

gz{p,Tp) = Urn a^(p,Tp(T);r) = lim ^^(p,T-pT;T),(89b) 


where 5 a;(r, Tr-T; T) and gz{p,TpT;T) are the MMSE estima¬ 
tion functions for the scaled penalties (|67]l. Note that we 
have used the scalings in ( |88] l, which show tv(T) « Tt^ 
and Tpir) « TpT for small T. Now, the scaled function 
gx{r,TrT;T) is the expectation E(x|T) with respect to the 
density 


p{'K\r,TrT-,T) cx exp 


fxjx) 

T 




Laplace’s Principle | |50) from large deviations theory shows 
that (under mild conditions) this density concentrates around 
its maxima, and thus the expectation with respect to this 
density converges to the minimum 


lim gx{r,TrT]T) = argmin/a;(x) + i||x- rH^^, 
r—vo X 


which is exactly the minimization in the MAP estimation 
function ( [62| ). The limit of gz{p, Tp/T]T) as T —0 is similar. 
We conclude that the limit of the ADMM-GAMP algorithm 
with MMSE estimation functions 0 and scaled densities 
( |67| l is exactly the ADMM-GAMP algorithm with the MAP 
estimation functions ( |62| l. In particular, for each T, the density 
over Xj in •EH is given by 


bl,{xj\rj,Tr^T) cx exp 


fxj i^j) 
T 


(x,-r*)2 

2Tr*^ 


(90) 


from which we can prove the limits in ( |68l l. 

It remains to show that the LSL-BFE in ( [T6] l with the scaled 
functions ([67li decomposes into the optimizations (|56]) and (fTT]) 


as T ^ 0. To this end, let J{bx, b^; T) be the LSL-BFE @ 
for the scaled penalties which is given by 

J{bx,bz-,T) = D{bx\\Z-^e-f^/^) + D{bz\\Z-^e-f^/^) 

+ H{vaT{x\bx),vaT{z\bz,)) 

= ^[E{fxiK)\bx)+E{fziz)\bz) 

-f H{vaT{x\bx), var(z|&2)) 

- H{bx) - H{b,,) + const, (91) 


where H{a) denotes the differential entropy of distribution 
a, H{tx,Tz) is the entropy bound from ([T7), Zx,t = 
f e ZzT — /e and the “const” in 

is with respect to b^ and b^. Now, we know that, as 
T —0, the optimal densities bx and b^ will concentrate 
around their maxima with variance 0{T). Thus, we can take 
a quadratic approximation around the maximum 

In bx,j {xj ) « ^ 3 ) (92) 

Txi 


where 

Xj = arg min - In 6^;^ {xj ), 

Xj 


1 d^lnbx^ixj) 

dx^ 

j 


with a similar approximation for In bz.(zi). Under these 
approximations, bxj(xj) and bz-{zi) become approximately 
Gaussian, i.e.. 


Kj{.Xj) « M{Xj,TTxj), 


bzj{Zi) « M{Zi,TTzj). 


(93) 


Using these Gaussian approximations, we can compute the 
expectations 




k=0 


hx-x,mx-.x,.Trx.}1x 

k^O ■ 

00 ^( 2 /) (^ 

= E con? 


z=o 


(20! 


~ 2H\ ’ 


/=o 


(94) 

(95) 

(96) 

(97) 


where (a) wrote fx (x) using a Taylor series about x = %; 
(b) assumed the exchange of limit and integral; (c) used the 
expression for the Gaussian central moments, which involves 
the double factorial (2Z —1)!! = (2^ —1)(2Z—3)(2/—5) x • • • x 1; 
and (d) used the identity (2Z — 1)!! = Thus, for small T, 
we have 


E{fx^{Xj)\bx^) 

» fx,{Xj) + ^TXxJx^iXj), 

(98a) 

e(/..(a)|6z.) 

W fzi{%) + ^TTzJ',[iZi). 

(98b) 


The differential entropies of these Gaussians ( |9^ are 

H{bxj) = l-ln{2TTeTTxj), H{bzJ = l-ln{2TTeTTzJ, (99) 
2 2 





















16 


and the entropy term ([TtIi i 


IS 


H{\aT{x\b^),\aT{z\b^)) = H{Tt^,Tt^) 


1 


— + ln(27rTrpJ 


1—1 ^ 


, Tp ^ Sx,. (100) 


Substituting ( |98l l, ( |99l l and ( |100| l into ( |9T] l, we obtain 
Jribx^h) = ^J(x,z) + X, z) + const, (101) 


where J(-) and J^(-) are given in ( |5^ and ( |72| l. As T —^ 0, the 
first term in ( |101| l dominates, implying that the optimization 
of (x, z) can be conducted independently of Tx, Tz, as in (|5^. 
The subsequent optimization of (x^, x^) then follows, as given 
in (|^. 


Appendix C 

Proof of Theorems[2]and[3] 

We will just prove Theorem]^ since the proof of Theorem!^ 
is very similar. For the original constrained optimization ([T^l7 
dehne the Lagrangian 

LoihxX, s) = J{hxX) + syE(z|6,) - AE(x|5,)). (102) 


We need to show that any hxed points {bx, bz, s) of ADMM- 
GAMP are critical points of this Lagrangian. 

First observe that, any hxed point, Tr from line 22 of 
Algorithm satishes 


l./(2x,) = l./(2x,) = (103) 

where the last step follows from the construction of x^ in ( [32l i. 
Similarly, at any hxed point of line 

l./(2xp) = l./(2xp) = (104) 

OTz 

From ( |41b| l and ( |41c| l, we see that any hxed point satishes 

E{z\bz) = Av, E(x|&,r) = V. (105) 

Thus, the constraint in Cl is satished, in that E(z|6z) = 
AE(x|h 2 ,). Furthermore, since v minimizes ( |49l l, we know 
that it zeros the gradient of the corresponding cost function: 

0 = A'''D.rp(E(z|6^)-Av+Xp.s)+D.r,(E(x|&2.)-v+Xr.q), 

(106) 

where Dt- = Diag(l./x). Plugging ( |105| ) into the previous 
expression, we obtain 


q = —A^s. 


(107) 


Since bx minimizes the augmented Lagrangian in (|41a[), it 


zeros the corresponding gradient, i.e., 
d 

0 = ^L(&x,6z,s,q,v;Xr,Xp) 

= ^ J{bx,bz,Tr,Tp) + q^E{K\bx) 

+ ^l|]E(x|6,r) - v||2 


dbx 

dbx 
(J jd_ 
dbx 


J{bx,bz,Tr.,Tp) - q^E(x| 62 ,) 
J { bx , bz , Tr , Tp ) - s'''AE(x|6:j,) 

J{bx,bz) - H{vsi{yi\bx),Tz) 


(g _d 
db . 


+ (l./( 2 xr))'''var(x| 62 ,) - s'''AE(x|6a;) 
J{bx,bz)-s^AE{^\bx) 


^ T lu U \ 


(108) 


where (a) follows from substituting ( |40| and eliminating terms 
that do not depend on bx, since their gradient equals zero; (b) 
follows from ( |105| l; (c) follows from Col; (d) follows from 
the dehnitions of the original and linearized LSL-BFEs in ( [T6| ) 
and d); (e) follows from the chain rule and the gradient 
in ( |103| l; and (f) follows from ( |102| l. A similar computation 
shows that 

^Loibx,bz,s) = 0. (109) 

dbz 

Together, ( |108| l and ( |109| ) show that {bx,bz) are critical points 
of the Lagrangian LQ{bx, bz,s) for the dual parameters s. Since 
these densities also satisfy the constraint E(z|5^) = AE(x|6a;), 
we conclude that {bx,bz) are critical points of the constrained 
optimization ([Tl. 


Appendix D 
Proof of Lemma[T] 

For the MAP estimation functions (|62ll, we know that 


i = 9x, {rj , Tr^) = arg min fx, ^ [xj - ) 


2x,. 




which implies that Xj = Xj is a solution to 0 = fx (xj) + 
{xj — rj)lTrj, i.e., that 

=^3 -TrJ'x.iXj)- 

Taking the derivative with respect to Xj, we hnd 
^-1-T fix] — 
which can be rearranged to form 


1 


■ 

^"^3 f ( \ 

Dr, “’"‘'■’’’■"’■l + eftK, 


( 110 ) 


Then, given the assumption in the lemma, ( |1 10| ) implies that 

1 / / N 1 

< 9xArj,TrA < 


1 + BTx 


1 + ATx 
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A similar bound can be obtained for {pi^ which proves 
<0 for any fixed and Tp. 

The proof for MMSE estimation functions uses a clas¬ 
sic result of log-concave functions |j^. Since the functions 
fx and fz are separable, so are the estimation functions gx 
and gz (|5T]i, as established in ( |52| ). In particular, we can write 

9x^{rj,Tr.) =E{Xj\rj,Tr.), gzi{Pi,Tp.) = E(Zi |pi, Tp, ) , 


where the expectations are with respect to the densities 


p{xj\rj,Tr^) cx exp 


p{zi\px,Tp.) cx exp 


fxj {Xj) 

-fziizi) - 


‘^Tr- 

{Zj - PiY 


(111a) 

(111b) 


We then need to show that the condition is satisfied for 
each of the functions gx- and . Below, we prove this for 
gx^, noting that the proof for is similar. 

From ( [55| l, we know that the derivative of gx^ {xj, Tr ^) with 
respect to Xj is given by 


9x, (rj , J = ^ , Tx^ = varixj \xj,Tr^). (112) 

The variance here is with respect to the density ( |1 1 la| i, which 
can be rewritten as 


p{Xj\Xj,Tr-) = exp[-h{xj)] 

for the potential function 


Hxj) = fx,{Xj) + 


which has second derivative 


2Tr- 


= fxAXj) + — . 


By assumption ( |74l l, this derivative is bounded 

A H-< h"{xj) < B -\ -. 

Tj-- Trz 


as 


In particular, h{xj) is strictly convex. From ( |112[ t and |63 
Theorem 4.1], we have that 

\W:{Xj\xpTrA 


9'xArj,TrA = 



< —IE 1 / . I ' 

Tr- \h''{x) J ATx.+I 

It is also shown in equation (4.13) of |j^ that 

Tx, V3Ll{Xj\Xj,TrA 


(113) 


9xArj,Tr,) = 


> 


1 


E{h"{xj)) 


> 


JjJ'rj 


BXr + 1 


(114) 


Thus, we conclude that 
1 


1 + BTx 

which proves (|73ll. 


< 9xArj,TrA < 


1 


1 + Atj. - ’ 


Appendix E 
Proof of Theorem|4] 

We find it easier to analyze the algorithm after the variables 
are combined and scaled as 


A 

T = 


D ^ Diag(l./T), (115) 


and 


W 4 D 1/2 


Also, we define 


u4d-i/" 


B = 


5(w,t) = 


9x{y^,Tr) 

9z{z,Tp) 


(116) 

(117) 


and henceforth suppress the dependence on t in the notation 
since r is constant in this analysis. The mean update steps in 


Algorithm then become 

w‘+i = - u*)) (118a) 

u*+i = u‘ -f - BV* (118b) 

= argmin -f — Bv|p, (118c) 

V 

where the result of ( |118c| l can be written explicitly as 

V* = (B’^B)-ib'^(w*- f u‘). (119) 


Let us define 

P = B(B^B) ^B^, P^ = I-P, (120) 


where P is an orthogonal projector operator onto the column 
space of B and P^ is the projection onto its orthogonal 
complement. Noting that Bv* = P(w* -f u‘), ( |118a[ ) reduces 
to 


w*+i = (d-i/2(P(w‘ -f u‘) - u‘)) 

= T>^/^g (d-i/2(Pw‘ - P-^u*)) 

= p(Pw* - P^u‘), (121) 

where 

g(w) ^ D1/2 ^(D^i/2w). (122) 

Also, since P-*-B = 0, ( |118b| l implies that 
P^u‘+1 = P-^u* -f P-^w'+i 

= P^u*-l-P^ 5 (Pw*-P^u‘). (123) 


Now define the state vector 




Pw* 

P^u‘ 


(124) 


Since P^ = P and (P-L)2 = pJ-, 

[P -P^] = Pw‘ - P-^u*. 

Therefore, from ( |121| i and ( |123| l, respectively, we have that 
Pw‘+i = Pp( [P-P-L] 6>*), (125) 

P-Lu*+1 = P-Lu*-f P-L5( [P-P-L] 0*). (126) 
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From ( |124[ ), P25| l, and ( |126| l, we see that the mean update 
steps in Algorithm are characterized by the recursive system 


For the lower bound, observe that 


0^+1 = f{9*) 


(127) 


3(9) ^=^uTp'(w)U- 


0 0 
0 


for 


(6) T 

> eU^U - 


0 0 
0 


/( 0 ) = 


p 


C 

c 

(c) 

eP 0 

p^ 

5([P-P-L]0) + 

0 P^ 

9. (128) 

_ 0 (e- 1)P^_ 


The following is a standard contraction mapping result 
if / has a continuous Jacobian /' whose spectral norm is less 
than one, i.e., 3e > 0 s.t. Il/'(e)|l < 1 — e V0, then the system 
( |127[ ) converges to a unique fixed point, 9*, with a linear 
convergence rate, i.e., 

30 0 s.t. ||6>‘-6>*|| 

So, our proof will be complete if we can show that the 
Jacobian of / from ( |128| l is indeed a contraction. 

First observe that, from the definition of (?(w) in ( |117| i, 
and the separability and boundedness assumptions in Assump¬ 
tion]^ the Jacobian of (/(w) at any w is diagonal and bounded: 

3e e (0,0.5] s.t. 5 '(w) = Diag(d) and e < dfe < 1 — e \/k. 

Since D = Diag(l./T) is also diagonal, the Jacobian of p(w) 
in \\22) is given by 


p'(w) = D-1/2 Diag(d)D^/2 ^ Diag(d), 


and hence 


.r{o) = 

Hence, if we define 


P 

P^ 


p'(w)[P-P^] 


0 0 
0 P^ 


(d) (e-l)I 0 

0 (e-l)I 

> (£-1)1, 


I - eP^ 

0 


0 

(l-e)P 


(136) 


el < 5 '(w) < (1 - e)I (129) 

for all w. Now, the Jacobian of f{9) in (|128|) is given by 


where step (a) follows from ( |132| ); (b) follows from ( |129| l; 
(c) follows from the definition of U in ( |133[ l and the fact 
that P and P-'* are orthogonal projections; (d) follows from 
the definition of P-'- in ( |120| i; and ( |136[ ) follows because the 
eigenvalues of P-'- and P are in the interval [0,1] and because 
e € (0,0.5]. Together ( |135] l and P36| l show that 

||/'(e)|| = ||J(0)||<l-e. 

Hence the f'(9) is a contraction and the ADMM-GAMP 
algorithm converges linearly at rate 1 — e. 

Appendix F 
Proof of Theorem|5] 

We need to prove that the conditions of Assumption [T] are 
satisfied. Property (a) is satisfied since Theorem shows that 
the constrained linearized LSL-BFE optimization ( |3^ has a 
unique minima for any (r^, Tp) > 0. 

We next construct the set F. From the proof of Lemma 
we know that when = var(x|r, r^) and Tz = var(z|p,Tp), 


(130) 


"Fx G 


Tz G 


At^ 

A + 7 


A- 


BTr 

B + Tr 

BTp 

B + 


PJ 


Hence 


J(0) 4 f{9) 


then ||,/'(0)|| = ||J(0)|| so f'{9) is a contraction if and 
only if J(0) is. Therefore, it suffices to prove that J(6) is 
a contraction. Combining (|130|) and (|131|l, we obtain 


(137) 

(138) 

(139) 

Tp /i-l-T-pJ 

Now consider a set F of the form 

r={(x^,Tp) I Tr&[ar,br], TpG[ap,6p]}. (140) 


[1 0 1 





0 -I 

, (131) 

1 

1 



V 'Tp/ "fp 

B + Tp 

A-\- Tp_ 


J(0) = UT5'(w)U- 


0 0 
0 P^ ’ 


(132) 


where w = [ P P-'- ] 9, and 

U=[PP-L]. (133) 

Since P is an orthogonal projection and P-*^ is the projection 
onto the orthogonal complement, U is an isometry. That is. 


UU"^ = P + P-^ = I, (134) 


and hence U^U < I. Therefore, from ( |132| i and ( |129| ), 

J(6>) < uV(w)U < (1 - ejU'^U < (1 - e)I. (135) 


In order that F satisfies Assumption [^c), we need to find 
bounds ar,br,ap, bp, such that if (r,., Tp) G F, then (Tr,Tp) G 
F where (Tr,Tp) are given in ( [T5| ). 

To this end, first observe that shows that < 1/B, 

so Tp = St^ < bp for some bp. If Tp < bp, ( |139] l shows that 
Ts G \\/B,\/{A + bp)]. Therefore, using the boundedness 
assumptions on S, = 1./(S^Ts) G [ar,br] for some 
lower and upper bounds and br. Finally, if > a^, 
Tx > Aarl{A + Or) and hence Tp — Sr^ > Op for some 
Op. We conclude that we can find bounds ar,br,ap,bp, such 
that if {Tr,Tp) G F, then (Tr,Tp) G F, and F is a compact, 
convex set satisfying Assumption [TJc). 

Finally, we need to show the convexity assumption in 
As sumption [TJb). The linearized LSL-BFE in ( [3T] i is separable. 
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SO we only need to consider the convexity of one of the terms. 
To this consider a prototypical term of the form 

J{b) = D{b\\e~^^) + - —var(a;|&), (141) 

where b{x) is some density over a scalar variable x and 
fx{x) is a convex penalty function. The Hessian of J{b) is a 
quadratic form that takes perturbations vi{x) and V 2 {x) to the 
density h{x) and returns a scalar. We will denote this Hessian 
by J”{b){vi,V 2 )- Differentiating ( |141| i we obtain that 

J"{b){v,v) = J dx—^ w(a;)a;da;^ . (142) 

We need to show that this is positive. For any v{x), let u{x) = 
v{x)/b{x) so that v{x) = u{x)b[x). Since a perturbation to 
the density b{x) must satisfy J v(x) dx = 0, we have that 

E(M(a:)|6) = J u{x)b{x) dx = 0. 


Also, J”{b){v,v) above can be written as 


J''(b){v,v) E('u(x)^|6) 

= var(u(a;)|&) 


—E(M(a;)a;|6) 

Tr 

—E^(u(x)(x 


(c) 

> var(u(a;)|&) 



bx)\b) 

(143) 


where (a) follows from substituting v{x) = u{x)b{x) into 
( |142[ ); in (b) we have used the notation /i^; = E(a:|6) and 
the fact that E(it(a::)|6) = 0; and (c) follows from the Cauchy- 
Schwartz inequality with the notation = var(a;|&). Now, 
using ( |137| l, we see that when {Tr,Tp) G T, we have the 
lower bound. 


, Tx ^ B Qr 

1-^ > 1-> -1— 

Tr B + Ur B + ttr 


> 0 . 


We conclude that there exists an e such that 


Therefore, x = x° is the unique solution to 


0 = /'(x) + Diag(l./(2r0))(x - x°) + qO, 


which implies 

/;(x°) = -q°. (144) 

By the induction hypothesis ^T7\ , x* = x° and q* = q°. Since 
x‘+i = 5a:(r*j Tr), we have x = x*+^ is the unique solution to 

0 = /(x) + Diag(l./(2r,*))(x - r‘) (145) 

= /'(x) + Diag(l./(2r,*))(x - x^) + qO, (146) 

where we have used the fact that 


r^q* =x0 


T-* nO 

T^.q . 


From ( |144| ), x = x° is also a solution to ( |145| l. Therefore, 
x‘+i = x°. Similarly, if s* = s° and z* = z°, then = z°. 
From ( |49l l, = v°. We conclude that if ( |77| l is satisfied for 

some t, it is satisfied for t+1. So part (a) follows by induction. 

To prove part (b), we leverage the convergence result from 
|65|. Using our earlier result ( |110| i, we have that 

= Tr .g'^. (r*, r*.) = - ^, -. 

Xj rjiiXjK ji TjJ _j_ p/ (a:- )t* 

J Xj\ j ) Vj 

Rewriting this in vector form and using the updates in Algo¬ 
rithm with 0* = 1, we obtain that 

l./rfi = l./T^ + /"(x‘+i) = SV‘ + /;(x‘+i) 

= SV‘+^,, 4/"(x*+i) > 0 (147) 

where /"(x) = [/"^ (xiand where ^x is 
positive due to the convexity assumption and invariant to t 
due to part (a). Similarly, for the output estimation function 

9z, 




= Tp-g'ziP^Tp) = Tp./(1 + /"(z‘+ ).T*). 
Therefore, from the modified update of in ( |76l l, 
rfi = /"(z‘+i)./(l + /;(z‘+i).r‘), 


f'ib) > el, 

at any minima b = b to the linearized LSL-BFE when 
{Tr,Tp) G F. This proves Assumption [^b). The uniform 
boundedness of all the other derivatives follows from the fact 
that all the terms are twice differentiable and the set F is 
compact. 

Thus, all the conditions of Assumption and the theorem 
follows from Theorem [T] 


Appendix G 
Proof of Theorem|3 

We begin with proving part (a). We use induction. Suppose 
that ( |77| ) is satisfied for some t. Since q°, x°, and v° are fixed 
points, we have from line 10 of Algorithm that x° = v°. 
Then, since x° is a fixed point, we have from lines and 
and equation (|6^ that 


= argmin/2,(x) + 4||x- + r°.q°||2o. 

X ^ 


or equivalently, 

l./r« =T^p + l.//"(z*+i) = Sr^ +1. (148) 

I, ^ l.//"(z‘+')- (149) 

Now define the maps, 

$s(Ta;) := l./(Srx +4z) 

4’x(ts) := l./(SV5 + $,x) 
so that the updates ( |148| l and ( |147| i can be written as 

), r‘+i = d>x (r*). 

Note that, due to part (a), ^x and in \IA1) and ( |148| l, do 
not change with t. It is easy to check that, for any S > 0, 

(i) 4’s(Ta,) > 0, 

(ii) Ta; > r' ^ and 

(iii) For all a > 1, > {I/a)<^ s{tx)■ 

with the analogous properties being satisfied by <i>a;(Ts). Now 
let $ := $ 2 ; o <i)j; be the composition of the two functions so 
that Then, $ satisfies the three properties: 
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(i) $(x,) > 0 , 

(ii) and 

(iii) For all a > 1, ^{ar^) < a^{rx)- 

Also, for any Tg > 0, we have $a;(Ts) < l./^x, and therefore, 
^’(T-a;) < l./lx for all Fc ^ 0- Hence, taking any 'Tx ^ 
we obtain: 


Tx > ^{Tx)- 


The results in then show that the updates 
converge to unique fixed points. Since the increment increases 
by two, we need to apply the convergence twice: once for the 
with odd values of t, and a second time for even values. 
Since the limit points are unique, both the even and odd sub¬ 
sequences will converge to the same value. A similar argument 
shows that also converges to unique fixed points. 


Appendix H 

Original GAMP via Stale, Linearized ADMM 


First, we examine the minimization in ( |79b| l. Starting with 
a derivation identical to ( |4^ , but with = E(x|&^^) 
in place of v, yields 


L{b*+^,b^,s*-Tp) 

= J{b*^^,b^,Tr,Tp) + (s*)'''(E(z|&^) - Ax*+^) 

-f f||E(z|6^) - Ax‘+if (150) 

^ M WTp 

= D{b^\\Z~^e~f^) + (l./(2Tp))^var(z|62) (151) 


+ E(i|! 
= D{b,\\Z 


Z — (Ax*"*"^ — Tp.S* 


\bz) + const, 

i=i 




e -'rp.s*)|| 2 j 6 ^) 


-f const, 

= / &z(z)ln- 7 ——r 

exp(-/,(z)-i 

+ const 

= 0 ( 6211 ^ 2 ) + const. 


l|z-(Ax*+i-rp.s*)||2J 


dz 


(152) 

(153) 


where “const” is constant with respect to b^ and Pz(z) c>c 
exp(—/^(z) —f ||z—(Ax*+^—Xp.s‘)||^^). Thus, the minimizing 
density 6 ^ output by (|79b|l is 


bf\z) oc exp ( - /z(z) - f ||z - p*+^||^J (154) 

p‘+i = Ax*+i - Tp.sK (155) 


Next we examine the minimization in ( |79a[ ). The objective 
function in (|79a|i can be written, using (|78]l, x* = E(x|&J,), 


and ( [ 3 T] i, as follows: 

^ 2 :; S , Tp) 

+ f (E(x|6,) - x‘)^(D.,.^ - A^D^^A) (E(x|&,) - x‘) 

= D{bx\\e~f^) + (l./xr)'''var(x|62,) - (s‘“^)'''AE(x|6a;) 

+ fE(x|6,)TATD^^AE(x|6,) - (z‘)Td^^AE(x|6,) 

+ fE(x|6,)T(Dz.^ - aTDz.^A)E(x|6,) 

— (x*)^(Dt-,, — A^Dt-pA)E( x|6a;) -f const 

= D{bx\\e~f^) + (l./xr)'''var(x|63,) - (s‘“^)'''AE(x|6a;) 

— (z*)TD^pAE(x| 5 ,) + iE(x|6,)TDz.pE(x|6,) 

— (x*)'''DT-pE(x|6a;) + (x*)'''a'''D.t-p AE(x|6a;) + const 
= D{bx\\e~^^) + (l./xr)'''var(x|63,) - (s‘)'''AE(x|6a;) 

+ f E(x|6a;)'''DT-pE(x|63,) — (x‘)'''Dz-pE(x| 5 a;) + const 

= D{bx\\e~^’^) + (l./Tr)'''var(x|&2,) 

+ f E(x|6a;)'''Dz-pE(x|63;) — (r*)'''DT-pE(x|6a;) -f const 
= D{bx\\e~^^) + (l./xr)'''var(x|62,) + f ||E(x|&2;) - r‘||^^ 
+ const 

= L)(6a;||e“A) _|_E(i||x-r‘||^^|&2,) + const, ( 156 ) 

= / 6a;(x)ln— . f -nil—: dx + const ( 157 ) 

= 0(6,1,11^2,) + const, ( 158 ) 

where “const” is constant with respect to bx\ line (a) used 
( | 79 c| i; line (b) used 

r* = X* + Diag(T,.)A'^s‘; ( 159 ) 

and line (c) used Px{^) oc exp{—fx{x) — ^Ijx —r*||^^). Thus, 
the minimizing density bx output by ( | 79 a[ ) is 

6*+^(x) cx exp{- fx{x) - i||x-r‘||2J. ( 160 ) 

Finally, using ( | 155 | l in ( | 79 c| i, we obtain 

s‘+i = (z‘+l - P*+ 1 )./Tp. ( 161 ) 

Thus, we have recovered the mean updates of the original 
sum-product GAMP algorithm, i.e., the non-indented lines in 
Algorithm 
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